Predictive Policing - Technical Implementation

MUSA 5080 - Fall 2025

Author

Jinheng

Published

November 3, 2001

About This Exercise

In this exercise, you will build a spatial predictive model for burglaries using count regression and spatial features.

Learning Objectives

By the end of this exercise, you will be able to:

  1. Create a fishnet grid for aggregating point-level crime data
  2. Calculate spatial features including k-nearest neighbors and distance measures
  3. Diagnose spatial autocorrelation using Local Moran’s I
  4. Fit and interpret Poisson and Negative Binomial regression for count data
  5. Implement spatial cross-validation (Leave-One-Group-Out)
  6. Compare model predictions to a Kernel Density Estimation baseline
  7. Evaluate model performance using appropriate metrics

Setup

Code
# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)

# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

cat("✓ All packages loaded successfully!\n")
✓ All packages loaded successfully!
Code
cat("✓ Working directory:", getwd(), "\n")
✓ Working directory: /Users/jinhengcen/Documents/25fall/PPA/portfolio-setup-CenJinHeng/assignments/assignment_4 

Part 1: Load and Explore Data

Exercise 1.1: Load Chicago Spatial Data

Code
# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 25 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 277 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')
Reading layer `chicagoBoundary' from data source 
  `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84
Code
cat("✓ Loaded spatial boundaries\n")
✓ Loaded spatial boundaries
Code
cat("  - Police districts:", nrow(policeDistricts), "\n")
  - Police districts: 25 
Code
cat("  - Police beats:", nrow(policeBeats), "\n")
  - Police beats: 277 
NoteCoordinate Reference System

We’re using ESRI:102271 (Illinois State Plane East, NAD83, US Feet). This is appropriate for Chicago because:

  • It minimizes distortion in this region
  • Uses feet (common in US planning)
  • Allows accurate distance calculations

Exercise 1.2: Load Burglary Data

Code
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read("data/burglary.shp") %>% 
  st_transform('ESRI:102271')
Reading layer `burglary' from data source 
  `/Users/jinhengcen/Documents/25fall/PPA/portfolio-setup-CenJinHeng/assignments/assignment_4/data/burglary.shp' 
  using driver `ESRI Shapefile'
replacing null geometries with empty geometries
Simple feature collection with 152664 features and 16 fields (with 28 geometries empty)
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -87.90451 ymin: 41.64537 xmax: -87.52456 ymax: 42.02253
Geodetic CRS:  WGS 84
Code
# Check the data
cat("\n✓ Loaded burglary data\n")

✓ Loaded burglary data
Code
cat("  - Number of burglaries:", nrow(burglaries), "\n")
  - Number of burglaries: 152664 
Code
cat("  - CRS:", st_crs(burglaries)$input, "\n")
  - CRS: ESRI:102271 
Code
cat("  - Date range:", min(burglaries$date, na.rm = TRUE), "to", 
    max(burglaries$date, na.rm = TRUE), "\n")
  - Date range: Inf to -Inf 

Question 1.1: How many burglaries are in the dataset? What time period does this cover? Why does the coordinate reference system matter for our spatial analysis?

Your answer here:

  • There are 152664 burglaries in my data set (19734 in 2017). I chose Sanitation Code Complaints.

  • This data set covers period between 2000 and 2018.

  • Because the Earth is curved in the reality, yet we attempt to analyze space on a flat xy plane. We use projection to define this transformation process. Due to differences in projection methods, the data points also vary in location. Therefore, we must employ a consistent projection method to ensure the accuracy of data positioning. We also need to determine the appropriate projection method based on different geographic locations and the requirements for calculating distances.

WarningCritical Pause #1: Data Provenance

Before proceeding, consider where this data came from:

Who recorded this data? Chicago Police Department officers and detectives

What might be missing?

  • Unreported burglaries (victims didn’t call police)
  • Incidents police chose not to record
  • Downgraded offenses (burglary recorded as trespassing)
  • Spatial bias (more patrol = more recorded crime)

Think About Was there a Department of Justice investigation of CPD during this period? What did they find about data practices?

Exercise 1.3: Visualize Point Data

Code
library(dplyr)
library(sf)

burglaries <- burglaries %>%
  mutate(Creation.D = as.Date(Creation.D, format = "%m/%d/%Y")) %>%
  filter(format(Creation.D, "%Y") == "2017")


# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Burglaries in Chicago",
    tag_levels = 'A'
  )

Question 1.2: What spatial patterns do you observe? Are burglaries evenly distributed across Chicago? Where are the highest concentrations? What might explain these patterns?

Your answer here:

  • Burglaries are not evenly distributed. There are two clusters in the density map. Lincoln Park and South Side are the highest concentrations (told from Google map). Since I am not familiar with Chicago, I just assumed that these are suburban neighbors that have lower revenue that leads to poorly maintenance, so they have more sanitation reports.

Part 2: Create Fishnet Grid

Exercise 2.1: Understanding the Fishnet

A fishnet grid converts irregular point data into a regular grid of cells where we can:

  • Aggregate counts
  • Calculate spatial features
  • Apply regression models

Think of it as overlaying graph paper on a map.

Code
# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

# View basic info
cat("✓ Created fishnet grid\n")
✓ Created fishnet grid
Code
cat("  - Number of cells:", nrow(fishnet), "\n")
  - Number of cells: 2458 
Code
cat("  - Cell size:", 500, "x", 500, "meters\n")
  - Cell size: 500 x 500 meters
Code
cat("  - Cell area:", round(st_area(fishnet[1,])), "square meters\n")
  - Cell area: 250000 square meters

Question 2.1: Why do we use a regular grid instead of existing boundaries like neighborhoods or census tracts? What are the advantages and disadvantages of this approach?

Your answer here:

  • We need fishnet to covert data points into areas that can be calculated for our modeling
  • Fishnet has a consistent size, so there are no boundary bias. It is a standard approach commonly used in policy analysis. It is easy to create and aggregate point data.
  • Howerver, it might split natural areas.

Exercise 2.2: Aggregate Burglaries to Grid

Code
# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

# Summary statistics
cat("\nBurglary count distribution:\n")

Burglary count distribution:
Code
summary(fishnet$countBurglaries)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    5.00    8.02   12.00  189.00 
Code
cat("\nCells with zero burglaries:", 
    sum(fishnet$countBurglaries == 0), 
    "/", nrow(fishnet),
    "(", round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "%)\n")

Cells with zero burglaries: 576 / 2458 ( 23.4 %)
Code
# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Burglaries",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Burglary Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()

Question 2.2: What is the distribution of burglary counts across cells? Why do so many cells have zero burglaries? Is this distribution suitable for count regression? (Hint: look up overdispersion)

Your answer here:

  • There are a few cells having very high amount. Cells with high amount concentrated in certain areas, reflecting the first law of geography.

  • Zero burglaries cells are in natural areas like parks, rivers

  • Our assumption is that variance = mean, however, the reality is that Variance > Mean, cuz we have a lot cells counted zero, making the distribution not suitable for our regression.

Part 3: Create a Kernel Density Baseline

Before building complex models, let’s create a simple baseline using Kernel Density Estimation (KDE).

The KDE baseline asks: “What if crime just happens where it happened before?” (simple spatial smoothing, no predictors)

Code
# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )

cat("✓ Calculated KDE baseline\n")
✓ Calculated KDE baseline
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

Question 3.1: How does the KDE map compare to the count map? What does KDE capture well? What does it miss?

Your answer here:

  • It better reflects regional characteristics across multiple grids that the count map, and it smooths out outliers.

  • It lost the details in local level.

TipWhy Start with KDE?

The KDE represents our null hypothesis: burglaries happen where they happened before, with no other information.

Your complex model must outperform this simple baseline to justify its complexity.

We’ll compare back to this at the end!

Part 4: Create Spatial Predictor Variables

Now we’ll create features that might help predict burglaries. We’ll use “broken windows theory” logic: signs of disorder predict crime.

Exercise 4.1: Load 311 Abandoned Vehicle Calls

Code
abandoned_cars <- read_csv("data/abandoned_cars.csv")%>%
  filter(!is.na(Latitude), !is.na(Longitude)) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) %>%
  st_transform('ESRI:102271')

abandoned_cars_2017 <- abandoned_cars %>%
  mutate(`Creation Date` = as.Date(`Creation Date`, format = "%m/%d/%Y")) %>%
  filter(format(`Creation Date`, "%Y") == "2017")

cat("✓ Loaded abandoned vehicle calls\n")
✓ Loaded abandoned vehicle calls
Code
cat("  - Number of calls:", nrow(abandoned_cars), "\n")
  - Number of calls: 260557 
NoteData Loading Note

The data was downloaded from Chicago’s Open Data Portal. You can now request an api from the Chicago portal and tap into the data there.

Consider: How might the 311 reporting system itself be biased? Who calls 311? What neighborhoods have better 311 awareness?

Exercise 4.2: Count of Abandoned Cars per Cell

Code
# Aggregate abandoned car calls to fishnet
abandoned_fishnet <- st_join(abandoned_cars, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(abandoned_cars = n())

# Join to fishnet
fishnet <- fishnet %>%
  left_join(abandoned_fishnet, by = "uniqueID") %>%
  mutate(abandoned_cars = replace_na(abandoned_cars, 0))

cat("Abandoned car distribution:\n")
Abandoned car distribution:
Code
summary(fishnet$abandoned_cars)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    21.0    81.0   105.7   161.0   890.0 
Code
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abandoned_cars), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "magma") +
  labs(title = "Abandoned Vehicle 311 Calls") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma") +
  labs(title = "Burglaries") +
  theme_crime()

p1 + p2 +
  plot_annotation(title = "Are abandoned cars and burglaries correlated?")

Question 4.1: Do you see a visual relationship between abandoned cars and burglaries? What does this suggest?

Your answer here:

  • The spatial pattern is not obviously visible in Burglaries map due to some very high values. Spatial pattern is not perfectly matched in these two maps: The northwestern region has a large number of abandoned vehicles, yet burglaries are rarely reported in these areas. Same situation for the mid-western areas.

  • These maps suggest that abandoned car and burglaries might be strongly correlated, which should be alarmed for our future model building.

Exercise 4.3: Nearest Neighbor Features

Count in a cell is one measure. Distance to the nearest 3 abandoned cars captures local context.

Code
# Calculate mean distance to 3 nearest abandoned cars
# (Do this OUTSIDE of mutate to avoid sf conflicts)

# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
abandoned_coords <- st_coordinates(abandoned_cars)

# Calculate k nearest neighbors and distances
nn_result <- get.knnx(abandoned_coords, fishnet_coords, k = 3)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    abandoned_cars.nn = rowMeans(nn_result$nn.dist)
  )

cat("✓ Calculated nearest neighbor distances\n")
✓ Calculated nearest neighbor distances
Code
summary(fishnet$abandoned_cars.nn)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   2.619   33.279   57.077  140.849  143.768 1621.235 

Question 4.2: What does a low value of abandoned_cars.nn mean? A high value? Why might this be informative?

Your answer here:

Exercise 4.4: Distance to Hot Spots

Let’s identify clusters of abandoned cars using Local Moran’s I, then calculate distance to these hot spots.

Code
# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
  
  # Create spatial weights
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
  
  # Calculate Local Moran's I
  local_moran <- localmoran(data[[variable]], weights)
  
  # Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to abandoned cars
fishnet <- calculate_local_morans(fishnet, "abandoned_cars", k = 5)
Code
# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Abandoned Car Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()

Code
# Get centroids of "High-High" cells (hot spots)
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )
  
  cat("✓ Calculated distance to abandoned car hot spots\n")
  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  cat("⚠ No significant hot spots found\n")
}
✓ Calculated distance to abandoned car hot spots
  - Number of hot spot cells: 373 

Question 4.3: Why might distance to a cluster of abandoned cars be more informative than distance to a single abandoned car? What does Local Moran’s I tell us?

Your answer here:

  • it helps us to capture spatial dependence at regional level. (we have calculate the spatial dependence at local and neighborhood levels). It omits the influence of outliers and captures spillover effects, which improves prediction accuracy.

  • Abandoned cars are concentrated in the northwest, while there are fewer abandoned vehicles in the southeast.

Note

Local Moran’s I identifies:

  • High-High: Hot spots (high values surrounded by high values)
  • Low-Low: Cold spots (low values surrounded by low values)
  • High-Low / Low-High: Spatial outliers

This helps us understand spatial clustering patterns.


Part 5: Join Police Districts for Cross-Validation

We’ll use police districts for our spatial cross-validation.

Code
# Join district information to fishnet
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(District))  # Remove cells outside districts

cat("✓ Joined police districts\n")
✓ Joined police districts
Code
cat("  - Districts:", length(unique(fishnet$District)), "\n")
  - Districts: 22 
Code
cat("  - Cells:", nrow(fishnet), "\n")
  - Cells: 1708 

Part 6: Model Fitting

Exercise 6.1: Poisson Regression

Burglary counts are count data (0, 1, 2, 3…). We’ll use Poisson regression.

Code
# Create clean modeling dataset
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    countBurglaries,
    abandoned_cars,
    abandoned_cars.nn,
    dist_to_hotspot
  ) %>%
  na.omit()  # Remove any remaining NAs

cat("✓ Prepared modeling data\n")
✓ Prepared modeling data
Code
cat("  - Observations:", nrow(fishnet_model), "\n")
  - Observations: 1708 
Code
cat("  - Variables:", ncol(fishnet_model), "\n")
  - Variables: 6 
Code
# Fit Poisson regression
model_poisson <- glm(
  countBurglaries ~ abandoned_cars + abandoned_cars.nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

# Summary
summary(model_poisson)

Call:
glm(formula = countBurglaries ~ abandoned_cars + abandoned_cars.nn + 
    dist_to_hotspot, family = "poisson", data = fishnet_model)

Coefficients:
                      Estimate   Std. Error z value             Pr(>|z|)    
(Intercept)        2.777848654  0.026930547 103.149 < 0.0000000000000002 ***
abandoned_cars     0.000575490  0.000095061   6.054 0.000000001413971234 ***
abandoned_cars.nn -0.009503078  0.000248184 -38.290 < 0.0000000000000002 ***
dist_to_hotspot   -0.000034234  0.000004185  -8.180 0.000000000000000284 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 17204  on 1707  degrees of freedom
Residual deviance: 11675  on 1704  degrees of freedom
AIC: 17173

Number of Fisher Scoring iterations: 6

Question 6.1: Interpret the coefficients. Which variables are significant? What do the signs (positive/negative) tell you?

Your answer here:

  • Variables that are significant: abandoned_cars, abandoned_cars.nn, dist_to_hotspot

  • The positive coefficient for abandoned_cars indicates that areas with more abandoned cars tend to experience higher burglary counts (sanitation reports). 

  • The negative coefficient for abandoned_cars.nn suggests that abandoned cars located closer together are associated with higher burglary counts.

  • The negative coefficient for dist_to_hotspot indicates that higher near clusters of abandoned cars (hotspots) are associated with higher burglary counts.

Exercise 6.2: Check for Overdispersion

Poisson regression assumes mean = variance. Real count data often violates this (overdispersion).

Code
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")
Dispersion parameter: 9.84 
Code
cat("Rule of thumb: >1.5 suggests overdispersion\n")
Rule of thumb: >1.5 suggests overdispersion
Code
if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("✓ Dispersion looks okay for Poisson model.\n")
}
⚠ Overdispersion detected! Consider Negative Binomial model.

Exercise 6.3: Negative Binomial Regression

If overdispersed, use Negative Binomial regression (more flexible).

Code
# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ abandoned_cars + abandoned_cars.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Summary
summary(model_nb)

Call:
glm.nb(formula = countBurglaries ~ abandoned_cars + abandoned_cars.nn + 
    dist_to_hotspot, data = fishnet_model, init.theta = 1.460291188, 
    link = log)

Coefficients:
                     Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)        2.89707098  0.07360773  39.358 < 0.0000000000000002 ***
abandoned_cars     0.00060609  0.00028418   2.133              0.03295 *  
abandoned_cars.nn -0.01183158  0.00052076 -22.720 < 0.0000000000000002 ***
dist_to_hotspot   -0.00003413  0.00001049  -3.252              0.00114 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.4603) family taken to be 1)

    Null deviance: 3126.4  on 1707  degrees of freedom
Residual deviance: 1779.6  on 1704  degrees of freedom
AIC: 10053

Number of Fisher Scoring iterations: 1

              Theta:  1.4603 
          Std. Err.:  0.0619 

 2 x log-likelihood:  -10042.8250 
Code
# Compare AIC (lower is better)
cat("\nModel Comparison:\n")

Model Comparison:
Code
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
Poisson AIC: 17173.2 
Code
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
Negative Binomial AIC: 10052.8 

Question 6.2: Which model fits better (lower AIC)? What does this tell you about the data?

Your answer here:

  • Negative Binomial has lower AIC, which means a better performance in predicting crime.

  • This suggests that the count data are overdispersed (the variance of burglaries is much greater than their mean, which violates the Poisson assumption.

Part 7: Spatial Cross-Validation

Standard cross-validation randomly splits data. But with spatial data, this means training on cells right next to test cells (information leakage!).

Leave-One-Group-Out (LOGO) Cross-Validation trains on all districts except one, then tests on the held-out district.

Code
# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()

cat("Running LOGO Cross-Validation...\n")
Running LOGO Cross-Validation...
Code
for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data
  model_cv <- glm.nb(
    countBurglaries ~ abandoned_cars + abandoned_cars.nn + 
      dist_to_hotspot,
    data = train_data
  )
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
  cat("  Fold", i, "/", length(districts), "- District", test_district, 
      "- MAE:", round(mae, 2), "\n")
}
  Fold 1 / 22 - District 5 - MAE: 3.66 
  Fold 2 / 22 - District 4 - MAE: 3.29 
  Fold 3 / 22 - District 22 - MAE: 4.84 
  Fold 4 / 22 - District 6 - MAE: 6.57 
  Fold 5 / 22 - District 8 - MAE: 6.75 
  Fold 6 / 22 - District 7 - MAE: 8.39 
  Fold 7 / 22 - District 3 - MAE: 5.97 
  Fold 8 / 22 - District 2 - MAE: 5.14 
  Fold 9 / 22 - District 9 - MAE: 5.16 
  Fold 10 / 22 - District 10 - MAE: 4.63 
  Fold 11 / 22 - District 1 - MAE: 4.56 
  Fold 12 / 22 - District 12 - MAE: 5.24 
  Fold 13 / 22 - District 15 - MAE: 4.86 
  Fold 14 / 22 - District 11 - MAE: 8.34 
  Fold 15 / 22 - District 18 - MAE: 10.12 
  Fold 16 / 22 - District 25 - MAE: 7.66 
  Fold 17 / 22 - District 14 - MAE: 9.19 
  Fold 18 / 22 - District 19 - MAE: 10.81 
  Fold 19 / 22 - District 16 - MAE: 7.74 
  Fold 20 / 22 - District 17 - MAE: 8.04 
  Fold 21 / 22 - District 20 - MAE: 7.25 
  Fold 22 / 22 - District 24 - MAE: 9.59 
Code
# Overall results
cat("\n✓ Cross-Validation Complete\n")

✓ Cross-Validation Complete
Code
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
Mean MAE: 6.72 
Code
cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
Mean RMSE: 9.94 
Code
# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
LOGO CV Results by District
fold test_district n_test mae rmse
18 19 63 10.81 14.21
15 18 30 10.12 15.13
22 24 41 9.59 13.25
17 14 46 9.19 12.40
6 7 52 8.39 11.89
14 11 43 8.34 11.38
20 17 82 8.04 17.58
19 16 129 7.74 9.01
16 25 85 7.66 9.21
21 20 30 7.25 11.04
5 8 197 6.75 10.76
4 6 63 6.57 9.43
7 3 43 5.97 7.89
12 12 73 5.24 7.08
9 9 107 5.16 6.81
8 2 56 5.14 6.79
13 15 32 4.86 6.60
3 22 112 4.84 6.77
10 10 63 4.63 6.33
11 1 28 4.56 6.45
1 5 98 3.66 6.37
2 4 235 3.29 12.34

Question 7.1: Why is spatial CV more appropriate than random CV for this problem? Which districts were hardest to predict?

Your answer here:

  • problems for CV here include:

    • Observations that are geographically close tend to be spatially correlated

    • the training set may include areas directly adjacent to those in the test set

    • creates spatial leakage, where the model indirectly learns from locations it is supposed to predict, so the model’s performance metrics appear overly optimistic

  • So we need spatial CV to solve these errors.

NoteConnection to Week 5

Remember learning about train/test splits and cross-validation? This is a spatial version of that concept!

Why it matters: If we can only predict well in areas we’ve already heavily policed, what does that tell us about the model’s utility?

Part 8: Model Predictions and Comparison

Exercise 8.1: Generate Final Predictions

Code
# Fit final model on all data
final_model <- glm.nb(
  countBurglaries ~ abandoned_cars + abandoned_cars.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Add predictions back to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

Exercise 8.2: Compare Model vs. KDE Baseline

Code
# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )

Code
# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance Comparison
approach mae rmse
model 6.06 10.29
kde 5.19 9.12

Question 8.1: Does the complex model outperform the simple KDE baseline? By how much? Is the added complexity worth it?

Your answer here:

  • Sadly, the complex model perform worse than the KDE baseline. RMSE for KDE is 9.12, while RMSE for the complex model is 10.29.

  • Sanitation report is not so strongly correlated with abandoned car. Next step might include trying another predictor, adding more predictor, or changing the spatial feature.

Exercise 9.3: Where Does the Model Work Well?

Code
# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

Question 9.2: Where does the model make the biggest errors? Are there spatial patterns in the errors? What might this reveal?

Your answer here:

  • Model errors perform randomly in their spatial distribution. There are a few cells recorded extremely high value in absolute model errors, however, they are randomly dispersed.

  • This indicates that the model’s performance was not affected by spatial correlations.

Part 10: Summary Statistics and Tables

Exercise 10.1: Model Summary Table

Code
# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with burglary counts."
  )
Final Negative Binomial Model Coefficients (Exponentiated)
Variable Rate Ratio Std. Error Z P-Value
(Intercept) 18.121 0.074 39.358 0.000
abandoned_cars 1.001 0.000 2.133 0.033
abandoned_cars.nn 0.988 0.001 -22.720 0.000
dist_to_hotspot 1.000 0.000 -3.252 0.001
Note:
Rate ratios > 1 indicate positive association with burglary counts.

Exercise 10.2: Key Findings Summary

Based on your analysis, complete this summary:

Technical Performance:

  • Cross-validation MAE: 6.06
  • Model vs. KDE: KDE performs better
  • Most predictive variable: abandoned_cars.nn

Spatial Patterns:

  • Burglaries are clustered
  • Hot spots are located in suburb areas on the northwestern and southern side
  • Model errors show random patterns

Model Limitations:

  • Overdispersion: Yes
  • Spatial autocorrelation in residuals: No, model error are randomly distributed.
  • Cells with zero counts: 23% of the data are 0

Conclusion and Next Steps

ImportantWhat You’ve Accomplished

You’ve successfully built a spatial predictive model for burglaries using:

✓ Fishnet aggregation
✓ Spatial features (counts, distances, nearest neighbors)
✓ Spatial autocorrelation diagnostics (Local Moran’s I)
✓ Count regression (Poisson and Negative Binomial)
✓ Spatial cross-validation (LOGO)
✓ Comparison to baseline (KDE)

::::::::::